INN Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

The shape of the dataset.

Observations

Summary of the dataset

Observation

Checking for Categorical columns

Observations

Exploratory Data Analysis (EDA)

Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

1. What are the busiest months in the hotel?

Observation

2. Which market segment do most of the guests come from?

Observation

To find out which market segment of customers stay longest in the hotel

3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

Observations

4. What percentage of bookings are canceled?

Observation

To see which year had most canceled bookings.

5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Observation

Lets find out how repeated guests behave with other variables in the dataset.

6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Observation

More EDA

Observation

Weekend nights vs Booking status

Summary of EDA

Data Description

Observation from EDA

Data Preprocessing

Checking for missing values

Outliers Detection and treatment

Building a Logistic Regression model

Objective

**We want to predict the booking status. Thus booking_status is the dependent variable.

We'll split the data into train and test to be able to evaluate the model that we build on the train data.

We will build a Linear Regression model using the train data and then check it's performance.

Checking Multicollinearity

Let's check VIF (Variance Inflation Factor) in the Training data.

Let's check VIF (Variance Inflation Factor) in the Test data.

Building a Logistic Regression model

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Logistic Regression

Logistic Regression (with statsmodels library) in the train data

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train8 as the final ones and lg1 as final model.

Coefficient interpretations

Checking model performance on the training set

ROC-AUC

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Model Performance Summary

Let's check the performance on the test set

Using model with default threshold

ROC curve on test set

Using model with threshold=0.76

Using model with threshold = 0.58

Model performance summary

Model Training

Model performance evaluation

Model can make wrong predictions as:

  1. Predicting that the guest did Not Cancel the booking but in reality the guest Canceled the booking.

  2. Predicting that the guest Canceled the booking but in reality the guest did Not Cancel the booking.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Build Decision Tree Model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.

In this case, we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.

class_weight is a hyperparameter for the decision tree classifier.

Checking model performance on training set

Checking model performance on test set

Visualizing the Decision Tree

Reducing over fitting

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

Observations from the tree:

Cost Complexity Pruning

Total impurity of leaves vs effective alphas of pruned tree

checking performance on training set

checking performance on test set

Visualizing the Decision Tree

Creating model with 0.002 ccp_alpha

Checking performance on the training set

Checking performance on the test set

Visualizing the Decision Tree

Comparing all the decision tree models

Conclusions

Recommendations